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Abstract 

Human genetic variation particularly in Africa is still poorly understood. This is despite a consensus on the large African 
effective population size compared to populations from other continents. Based on sequencing of the mitochondrial 
Cytochrome C Oxidase subunit II (MT-C02), and genome wide microsatellite data we observe evidence suggesting the 
effective size (We) of humans to be larger than the current estimates, with a foci of increased genetic diversity in east Africa, 
and a population size of east Africans being at least 2-6 fold larger than other populations. Both phylogenetic and network 
analysis indicate that east Africans possess more ancestral lineages in comparison to various continental populations 
placing them at the root of the human evolutionary tree. Our results also affirm east Africa as the likely spot from which 
migration towards Asia has taken place. The study reflects the spectacular level of sequence variation within east Africans in 
comparison to the global sample, and appeals for further studies that may contribute towards filling the existing gaps in the 
database. The implication of these data to current genomic research, as well as the need to carry out defined studies of 
human genetic variation that includes more African populations; particularly east Africans is paramount. 
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Introduction 

The fact that Africa is the cradle of modem humans and the 
scene of its major biological and demographic events is not of 
much dispute. This is manifested in various population measures 
including the significantly higher effective size of Africans 
compared to populations from other continents. One feature of 
this is the marked elevated heterozygosity across the genome and 
low LD [1,2]. However both are with inherent limitations in 
reflecting the true extent of genetic variation within a locus or 
entire genome and its genealogy [2]. The first evidence to relate 
the Homo sapiens sapiens origin, exodus and dispersal out of an 
African birth place, was based on the mitochondrial control region 
variation in a global sample of 147 females [3]. This is the 
chronology suggested by classical protein markers [4], and 
subsequently consolidated by genetic data from whole mitochon- 
drial genome sequence [5], Y chromosome [6] and autosomes 
[7,8]. The advent of large scale sequencing made whole 
mitochondrial genome sequencing the method of choice for 
evolutionary studies, gradually replacing the control region 
sequencing which has extreme variation in substitution rate 



among sites, and the consequent parallel mutations [5,9,10]. 
Nevertheless, whole mitochondrial genome sequencing, though 
capable of providing reliable overall phylogeny based on averaged 
variation between sequences, its estimate of dates, population size 
and structure could still potentially be influenced by conflict in 
evolutionary rates between sites and the genetic noise of HVRI 
and HVRII excess mutations [11]. The small non-recombining 
genome of the mitochondria allows a single gene or sequence to 
answer specific evolutionary questions that would reflect patterns 
obtained from the whole sequencing [12]. Although studies 
addressing sequence variation in mtDNA coding region have 
suggested that natural selection has significantly shaped the course 
of human mtDNA evolution [13,14], there is disagreement upon 
whether the distribution of specific human mtDNA haplogroups is 
due to an adaptation to different climates or if their distribution is 
a function of random drift assisted by purifying selection that 
eliminates non-synonymous changes [15]. The significance of a 
sequence displaying selective neutrality feature is its enhanced 
utility in estimating effective population size, one of the challenging 
and controversial undertakings in evolutionary genetics. Whole 
genome microsatellite or single nucleotide polymorphism data sets 
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reflects features of selective neutrality in an averaged manner, as the 
intensities of selective signals vary widely between chromosomes 
and genomic regions. Despite a consensus on a large African 
effective population size as compared to populations from other 
continents, the extent of human genetic variation is still poorly 
understood particularly in east Africa. mtDNA is one of the most 
convenient tools in resolving questions of population size and 
genetic diversity given its extended haplotype structure, non- 
recombining nature and uni-parental inheritance. These features 
combined, apart from the non-recombining portion of the Y 
chromosome, are present nowhere else in the human genome and 
hence make mtDNA variation ideal for screening large data sets. 
Furthermore inference made from the mtDNA hyper-variable 
control region or one of its coding genes are found to be highly 
correlated with whole mitochondrial DNA sequencing thus 
providing an added value for identifying population gaps in 
resource limited settings. In the current study, we explore the level 
of genetic diversity among east Africans utilizing a genome wide set 
of microsatellites and MT-C02 sequences to gain insights into the 
extent to which the east African gene pool has contributed to 
genetic variation at a global scale. 

Results 

Table 1 shows the population parameters and selective 
neutrality test (Tajima's D) based on MT-C02 variation of all 
continental groups, mean values and test of significance for the 
obtained values. Tajima's D (Table 1) scored negative values 
consistent with human expansion within and outside of Africa (or 
exchange of alleles between neighboring demes, see discussion 
below) with satisfactory statistical scores. The transition to 
transversion ratio of 2:1 in our reported Single Nucleotide 
Polymorphisms (SNPs) is consistent with being at the root of the 
gene tree and with neutral evolution distance based analysis using 
FST for mtDNA sequences and RST for microsatellite data were 
carried out for subsequent MDS plotting and a population by 
population correlation comparison using Mantel Test. The result 
showed no correlation with a P value of 0.66 similar to comparison 
between mitochondria and Y chromosome variations reported 
earlier [16]. 

Effective Population Size and Bayesian Skyline Plot (BSP) 
analysis 

Given the intense debate between those advocating the use of a 
phylogenetic mutation rate calibrated by the divergence between 
humans and chimpanzees and those studying the mutation process 
on pedigrees, we attempt to calculate the Ne values using different 
loci and mutation rates to emphasize the relation of the Ne values 
to both parameters. The estimated Ne based on MT-C02 in 
east and other parts of Africa and the rest of the world based on 
three different mutation rates of 9.2xl0~ 7 , 3xl0~ 6 [17] and 
3.15xl0 -7 [14] substitutions per site per generation of 25 years 
gave values that ranged between 6,195 at the lowest to 58,997 at 
the highest for east Africa (Table 2). Ne estimates based on 
microsatellites loci also using three different mutation rates [18,19] 
still gave higher value of 26,734—1,782,277 for east Africans, 
though the fold increase in Ne was lower than that obtained by 
MT-C02 (Table 2). 

As Ne can also differ when taken in different points in time; we 
did a Bayesian analysis enabling the growth rate option in the 
program LAMARC and found effective population growth at 
different generations in the past of the archaic Ne (Table 3). The 
expansion was first observed around 50, 000YBP with a doubling 
in the current effective population size observed at around 15, 
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000YBP. In all cases East Africa maintained a higher effective size 
of at least two to six times that of other populations including other 
Africans. The BSP (Figure 1, and supplementary Figures S5A-C) 
analysis also confirmed the results from LAMARC with the two 
major expansions starting around 50, 000YBP after a long steady 
decline of population size since coalescence time and another 
around 15, 000 YBP, albeit with higher estimates of archaic and 
current Me. Bayesian Skyline plots of individual populations 
(supplementary Figures S5A-C) depict varied population histories 
including time of coalescence and expansion and conform to large 
extent to mismatch plots of single groups. 

Effective Size versus Census Size 

A line graph (Figure 2) plots the log values of both Me based on 
MT-C02 and real and expected census size of east Africans, the 
rest of Africans and other continents [20]. The figure depicts an 
increased effective size in Africa and near the root in east Africa, 
relative to census size, while the trend is reversed in the out of 
Africa groups with exception of Australians among which 
considerable mtDNA diversity seems to have been maintained. 
Fisher's exact test performed to assess the difference between the 
observed and expected census values in Aboriginal Australians did 
not reject the null hypothesis for association (one tailed P= 0.06) 



between the Me and expected census size, usually taken to be ten 
times the effective size [21,22]. 

Mutations and Haplotypes Frequencies in the MT-C02 

The sheer number of haplotypes, a basic measurement of 
genetic diversity, is also taken as an indication of Me. As 
mitochondria are non-recombining the number of mutations 
and haplotypes is quite correlated. In the MT-C02 sequence 68 
haplotypes were estimated using Arlequin ver3.11 and assigned 
numbers from 1 to 68. Haplotype relative and absolute frequencies 
in the studied populations were also calculated. Strikingly, of the 
total 68 haplotypes, 43 occurred solely in east Africa (Table SI) of 
which 25 were in Sudanese, 9 in Eritreans and 5 in Ugandans and 
one Kenyan. The rest of the haplotypes were derived from or 
included east Africans with exception of 13 haplotypes, 4 in Africa 
2 in Australia, 3 in Europe 1 in Arabia 1 America/ Africa and 1 
Europe/ Africa. Of the 42 haplotype defining mutations (Table S2) 
in Sudanese and Eritreans 1 1 (26.2%) were non-synonymous 
(replacements) occurring in trans-membrane domain of COII 
protein while 31 (73.8%) were synonymous with transitions 
representing the majority of the mutations. Out of the 42 
mutations (Table S2), 31 were previously reported in the literature 
and 1 1 were novel. All mutations in Ugandan MT-C02 samples 




0 5 15 25 35 45 55 65 75 85 95 105 



Time (xlOOO years from the present) 

Figure 1. Bayesian Skyline Plot (BSP). BSP based on 543 base pairs of mitochondrial DNA MT-C02 region. The graph was constructed 
merging all populations as global population. The plot displays changes in world female effective population size (Nefj through time, a 25 year 
generation time, and a 9.2 x10~ 7 sub/site/generation mutation rate. Present day is on the left on the x-axis. An increase in world female population 
observed at around 50, 000YBP and around 10, 000-20,000YBP (the periods are highlighted). 
doi:1 0.1 371 /journal.pone.0097674.g001 
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Figure 2. Line Plot. Line plot of the effective size (lower), expected census size (middle line) and actual census size (upper) of the 
continental populations and some east African populations. Each point represents the log of actual values of the census, expected census 
and effective size of these populations. The expected census is obtained by calculating the current We by 10. 
doi:1 0.1 371 /journal.pone.0097674.g002 



are synonymous and reported at http://dspace.nwu.ac.za/ 
handle/ 10394/4221). All published haplogroups associated with 
the mutations are indicated in Table S2. 

Phylogenetic Inference: Median-Joining (MJ) Network 
and Neighbor Joining (NJ) Tree based on MT-C02 
Sequences 

The most ancestral haplotypes as judged by Neanderthal and 
Chimpanzee was a group of sequences that diverged early in 
human history forming a cluster denoted as (A) that includes 1 
individual sequence from the Rift Valley Kenya (Mkamba), in 
addition to Eritreans, Ugandans and Sudanese; another branch 
differentiate encompassing solely east African, predominantly 
Eritreans, Ugandans and Sudanese and denoted as cluster (B). 
The third cluster (C), includes members of almost all world 
populations particularly non-Africans who share a major haplo- 
type that seems to have originated within an east African gene pool 
(Table SI and Figure 3). The starry shape of the major haplotype 
(Figure 3) and values of Tajima's D indicate that the ancestral 
population has undergone at least one episode of expansion prior 
to the dispersal that led to the migration of our species out of 
Africa. 

This was also manifested in a neighbor-joining (NJ) tree of all 
MT-C02 sequences (Figure SI). In agreement with the network 
(Figure 3) the tree shows the root to be unequivocally occupied by 
east Africans (Red dots) and the dispersal of their sequences along 
the tree and at its tip particularly individuals from Eritrea and 
Ethiopia consistent with what had been found when other east 
African populations were analyzed [10]. 

Given the uncertainty in the value of divergence between 
Humans and Chimpanzees reported in the literature, and the wide 
range of dates for the divergence of Humans from Neanderthal 
and Humans MRCA and the differentiation of major mtDNA and 
Y chromosome haplogroups [14,23,24]. We adopted the lower 
threshold of 0.5 million for divergence of our species from 
Neanderthal and its most recent common ancestor to be 
~ 100,000 years and the divergence of the major haplotype in 
Africa to be ~50, 000YBP [25], less than reported elsewhere 
[14,26]. 



Multidimensional Scaling of Microsatellite and MT-C02 
Data 

The difference in population size and divergent history of the 
world populations based on Marshfield genome wide microsatellite 
data set are also depicted in a multi-dimensional scaling (MDS) 
plots. The first plot employs an FST based distance matrix 
(Figure 4A) while the other is a PLINK calculated matrix based on 
probability of identity by state (IBS) for individual samples 
(Figure 4B). Both MDS plots discriminated between Africans 
and non-Africans in their first coordinate, although in the IBS plot 
where the variance measure is more pronounced the drift effect 
and low Me is prominent. In the FST based MDS the 2 nd 
coordinate differentiates between Sudanese and the rest of Africa 
(except San), and between Asians and Europeans. Interestingly 
Beja population from Sudan maintains an intermediate position 
between Africans and non-Africans in both plots. The 3 rd and 4 th 
coordinates of the IBS plot appears to display subsequent 
demographic events where Africans particularly East Africans 
contributed independently to Europeans and Asian ancestry 
(Figures S2 and S3). 

An MDS of the MT-C02 data set (Figure S4) which includes 
additional populations to the microsatellite set, maintains elements 
of the clustering witnessed in other types of analysis in the present 
manuscript and literature, for example the association and central 
contribution of East African populations particularly Nilo-Saharan 
to early demographic events in Human history suggested by the 
first coordinate; the relationship between Nuba and Nubians and 
Moroccans and Ethiopians as shown in the second coordinate 
which appears to express elements of the out of Africa episode (s). 

Discussion 

The tenet of the human population genetic structure is best 
viewed in the light of the inherent diversity within the human 
genome. Humans are believed to have expanded from small 
founder demes somewhere in eastern Africa during the past 
75,000-1 00,000 YBP [27] even though debatable. However, much 
need still exists to unravel the forces behind this variation; as well 
as explain and date the demographic history [28]. A definitive 
answer to questions on population size, and past evolutionary 
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BEJK^ 



Figure 3. Median joining (MJ) Network tree. MJ tree based on MT-C02 gene sequences of world populations. Black points indicate, root; Blue, 
East Africans; Red, Africans; Orange, Australians; Green, Asians; Pink, Americans; Purple, Europeans. Branch lengths are not representative of 
evolutionary distance. The background colors outline possible demographic events in east Africa, and early episodes of evolution possibly towards 
the Rift Valley. 

doi:1 0.1 371 /journal.pone.0097674.g003 



dynamics of African human populations has been hampered, 
hitherto, by substantial gaps in sequence information from African 
population even though it is generally expected that it is the area 
with the most substantial variation existing unlike anywhere else. 
In particular, east Africa has been suggested by the shreds of 
existing data to be pivotal to recent and early human evolution 
[10,29]. In the present study we provide evidence of a significandy 
larger east African population size, based on various genome 
markers, most prominendy sequence data from the mitochondrial 
MT-C02, and autosomal microsatellite genetic variation. 

Historically, the MT-C02 gene has been employed in species 
and phylogenetic inference for the various merits mentioned 
earlier, and inference made from such sequence information are 
found to be highly correlated with large data sets and whole 
genome analysis [5]. Of the few studies done to evaluate the 
validity of using single conserved mtDNA gene as an evolutionary 
tool, MT-C02 was used to estimate divergence time between 
modern humans and chimpanzee depending on class 1 mutations 
and maximum likelihood analysis [30]. Its conserved functional 
nature gives values of selective neutrality attested by a general 
excess of transitions and other relevant genetic parameters. 
Likewise, the bulk of mutations that underlies this striking diversity 



in our data set exhibits similarly features of selective neutrality, 
with excess of transitions and negative values from Tajima's D test, 
hence are unlikely to be explained by a scenario of a selective 
sweep including the major haplotypes cluster that defines 
expansion within Africa and a subsequent out-of-Africa exodus 
event. The confinement of non-synonymous mutations to east 
Africans (mainly Sudanese) is consistent with a position at the base 
of the human evolutionary tree. Based on nested contingency 
analysis of MT-C02 sequences in humans and primates, 
Templeton [31] interpreted on biological grounds that trans- 
membrane replacement mutations seems to behave in a neutral 
fashion. It has been argued recently that even in a scenario of 
natural selection, the observed differences in fitness components 
may likely be evolutionarily neutral and small selective differences 
in life histories will be hard to measure, hence the effects of 
random drift will be amplified in natural populations by the large 
variances among individuals [32] . 

According to the current data East Africa is home to nearly 2/3 
of the world genetic diversity independent of sampling effect. 
Similar figure have been suggested for sub-Saharan Africa 
populations [1]. The antiquity of the east African gene pool could 
be viewed not only from the perspective of the amount of genetic 
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Figure 4. Multidimensional Scaling Plot (MDS). A. First and second coordinates of an MDS plot of 848 Microsatellite Marshfield data set across 
the human genome for 24 populations from Africa, Asia and Europe. MDS plot was constructed from pairwise differences FST generated by Arlequin 
program (Table S3). B. First and second coordinates of an MDS plot of 848 Microsatellite loci, across the human genome in 469 individuals from 24 
populations from Africa, Asia and Europe. MDS uses pairwise IBS data based on the 848 loci generated by PLINK software and plotted using R version 
2.15.0. East Africans cluster to the left of the plot, while Beja (red cluster in the middle), assumes intermediate position. 
doi:1 0.1 371 /journal.pone.0097674.g004 
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diversity endowed within it but also by signals of uni-modal 
distribution in their mitochondrial DNA (Hassan et al, unpub- 
lished) usually taken as an indication of populations that have 
passed through "recent" demographic expansion [33], although in 
this case, may in fact be considered a sign of extended shared 
history of in situ evolution where alleles are exchanged between 
neighboring demes [34]. 

We compare and contrast the MT-C02 sequence data to 
analysis of 848 microsatellites of the Marshfield dataset pardy 
published by Tishkoff et al. [8] and find convincing correlation and 
a persistent pattern where Africans and particularly east Africans 
show greater variance in both distance and Bayesian based 
analysis. Although such large population size is expected to reflect 
on the spectrum of SNPs and haplotype structure in the human 
genome, an in-depth investigation is not currently feasible in the 
absence of major sequencing efforts in east Africa. However, an 
impression based on available sequences of X-chromosome and 
candidate gene analyses indicate that this ratio does persist across 
data for east Africans (unpublished data from the MalariaGen 
consortium and exome sequencing from our own group). 

Our estimate of the population size is more in agreement with 
LD and coalescent dates as in Tenesa et al. [2]. Difference in 
estimate of archaic population size between studies in the range of 
1000-10000 [2,24,35] has been attributed to Ne being estimated 
over different points in time [2] . Emphasis should be laid as well 
on the evolutionary rate of the genetic loci used in the analysis. 
The 10,000 figure suggested by Takahata et al. [24] was based on 
MHC polymorphisms, a locus with one of the highest evolutionary 
rates in the human genome. We base our estimates on two classes 
of loci a mitochondrial gene with an intermediate evolutionary 
rate and, evidence of selective neutrality (MT-C02), and genome 
wide microsatellites, a fast evolving high mutation rate type of loci 
similar to HLA. Although larger Me of east Africans was 
consistently maintained, microsatellites produced values in the 
upper extremities of Ne. The size of difference for microsatellites 
was much narrower possibly due to the fact that higher mutation 
rates may lead to convergence, in addition to admixture, both 
elements that are known to affect the resolution of these markers in 
coalescence estimates. This should also be considered against 
arguments that estimates based on mitochondrial genome, tend to 
be biased towards low inbreeding effective size because of unique 
features pertaining to the mitochondrial genome [31]. Such 
disparities in the estimates of effective population size when using 
different genes and molecular markers [2,24,35] including the 
mitochondrial genome [36] may also account for discrepancies in 
the dating of the age of demographic events based on mitochon- 
drial genome or the human genome in general. The other factor 
that seems to influence the outcome of the analysis most is the 
effective size itself or the sheer number of variants in the target 
genome or loci. The report of larger number of mutations in the 
MT-C02 gene here, has led to different estimate of a mutation 
rate by a factor of 10. Interestingly there seems to be a trend for 
bringing the dates of coalescence more close to the present time 
and the population size smaller [2,37,38]. Based on Bayesian 
analysis of three previously analyzed nuclear loci sequences [38] 
reported Me of 7,000 for long term effective size, in tally with our 
figures. Interestingly the two expansions suggested by the BSP 
coincide with major putative demographic events: one the critical 
biological changes that might have taken place around 50, 
000YBP subsequently leading to the out of Africa expansion; 
and the other is advent of pastoralism and agriculture believed to 
have occurred around 10, 000-20, 000YBP, as the latter date was 
also attested by Y-chromosome [39], autosomal and mitochon- 
drial HVRI re-sequencing analyses [40]. These two demographic 



events are also corroborated by whole mitochondrial sequence 
analysis [41]. 

Previous anthropological and genetic evidence have suggested 
that long-term Me has been about three times larger in African 
populations than in non-Africans [1,35,42]. Simulations studies of 
Africa, Europe and Asia, suggest that the African effective size is 
still the largest of the three regions and is probably at least as great 
as the sum of the Asian and European effective sizes. Our data set, 
however, indicate that east African population size alone may be 
ten times larger than that of Europeans and Asians and three times 
that of other populations including Africans. In fact of 68 MT- 
C02 haplotypes only 14 were unique to populations other than 
east Africans. Re-sequencing of ~8 Mb in 20 independent non- 
coding autosomal region of different continental populations also 
pointed to a five times reduction of effective population size within 
the out of Africa migrants relative to an ancestral African 
population size [43]. The main point to be stressed here is that 
given the observation of east Africa consistently displaying a 
relatively higher long term population size, we conclude that 
human population size might be larger than anticipated. 

A relatively low effective size and the drift like nature of the 
event characterizing the group that made the exodus out of Africa, 
is manifested in the fact that it originates in a spectacularly large 
female source in east Africa. This is in concordance with previous 
data including X chromosome. In a recent report by Keinan et al. 
[7] the authors conclude that a sex-biased process that reduced the 
female effective population size, or an episode of natural selection 
that unusually affected chromosome X, was associated with the 
founding of non-African populations. 

The complex relationship between the census and effective size 
is influenced by both genetic drift and inbreeding. The drift nature 
of the out of Africa event and the subsequent demographic 
expansion is portrayed in the relationship between the census and 
effective size in east Africa and the rest of the world particularly 
Asia and Europe. Both drift and inbreeding are unlikely to have 
influenced east Africans in a substantial way due to the limited 
geographic range where those groups evolved and in-situ evolution 
of its population. 

Fluctuation in population size might be a mark of the out of 
Africa group because migration and challenges of adapting to new 
environments subject the population to both influences of drift and 
inbreeding. Cases of low census size and a larger inbreeding 
effective size are known in mammalian populations and attributed 
to recent population reductions [44]. Although the difference 
between the current and expected census for Australians was not 
statistically significant it still indicates an interesting feature of this 
isolated group. It is not clear why Australia was colonized with a 
higher population size than the populations that colonized other 
regions. Henn et al., [45] contemplated this in the light of lineage 
specific acceleration. Our findings, however, indicate that the 
population of Australia may have maintained a legacy of high Me 
originally carried by the ancestral group that left Africa and seen in 
the number of haplotypes that survived in their gene pool. This 
may suggest that both census and effective size of the group that 
made it to Australia was large enough to counteract the effect of 
drift and permit survival of relics of these original haplotypes. 

It is not only genetic data that lends support to an east African 
origin of humans but the unparalleled ethnic and linguistic 
diversity that remains one of the highest worldwide. Interestingly 
the two most ancestral sequences in the NJ tree figure refer to 
Nubian individuals. Nubia is currendy identified with one of the 
most ancient human settiements, the Say culture. Recently, a 
related compound associated with a lithic middle Stone Age 
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industry was discovered in Dhofar Oman and taken as an evidence 
of human migration out of Africa through an Arabian route [46]. 

Overall, the various genetic markers used in the current analysis 
support the observation of human effective population size larger 
than previously estimated, and emphasize the importance of 
sampling populations of putative deep ancestry. 

Materials and Methods 

Ethical statement 

Ethical approval has been obtained from regulatory bodies in 
Sudan (Ethics Review committee (ERC) Institute of Endemic 
Diseases, University of Khartoum: http:/ /www.healthresearchweb. 
org/br/sudan/ethics; Reference: 28/10/03etiend) and Eritrea 
(The Ethical Review Board of Eritrea Institute of Technology, 
Eritrea; Reference: June 21, 2011) for the sample sets analyzed in 
the Khartoum laboratory. Other data has been obtained from 
online sources. All past and present research adopts good practices 
in ethics and local and international guidelines for genetic research. 
This includes obtaining of written informed consents from all 
participants, keeping the data anonymity from the beginning and 
communicating research results to populations concerned. 

Data sets, Sequences and alignment 

Data sets used in this study include genotype information on 
848 nuclear microsatellites obtained from individuals of various 
world populations including Sudanese, Ethiopians and Kenyans 
available at http://www.sciencemag.org/content/324/5930/ 
1035/suppl/DCl and MT-C02 DNA sequences comprising 23 
Ugandans obtained from http://dspace.nwu.ac.za/handle/ 
10394/4221), 75 other Africans, 20 Asians, 18 Australians, 45 
Europeans and 20 Americans making a total of 180 sequences 
obtained from public mtDB database (http:/ /www.mtdb.igp.uu. 
se/) in addition to 81 Eritreans and 46 Sudanese sampled and 
genotyped following appropriate ethical consent and documenta- 
tions. PGR amplification of MT-C02 gene for the Eritreans and 
Sudanese samples was carried out at the Institute of Endemic 
Diseases, University of Khartoum and outsourced for commercial 
sequencing at BGI Hong Kong, China. MT-C02 sequences were 
identified by blasting against Yoruba reference sequence (accession 
number AF347015) and sequences with mutations were deposited 
at GenBank with accession number KC753688-KC753760. 

A total of 330 MT-C02 sequences were aligned using the 
software program BioEdit [47]. Yoruba reference sequence 
(GenBank accession number AF347015) was used as a reference 
sequence for the alignment. Polymorphic positions were visually 
noted and scored. For Eritrean and Sudanese populations, 
mutations were confirmed in forward and reverse electrophero- 
grams. Synonymous, non-synanymous and novel mutations are 
indicated in Table SI. Arlequin ver3.1 1 [48] was used to calculate 
the relative and absolute haplotype frequencies among study 
group sequences in addition to the shared haplotypes (Table S2). 

Statistical analyses 

Arlequin suite ver3. 1 1 [48] was used to provide information on 
genetic and demographic features of the studied populations. 
Haplotypes frequencies, pairwise FSTs, in addition Tajima's D test 
to check for selective neutrality were calculated for the studied 
populations. Fisher's exact test was used to test the probability of 
association between the expected census size being ten times the 
effective size and observed census size for current population of 
aboriginal Australians. The correlation between the two markers 
was evaluated by Mantel Test within Arlequin program. 



Populations' structures and affinities 

MDS (Figure 4A) was plotted from FST intermediate values 
(Table S3), generated from Arlequin version 3. 1 using Marshfield 
microsatellite allele frequency data, and employing cluster analysis 
option within the program PAST (Paleontological Statistical Tool 
program [49]. Versus the distance approach we also employ a 
Bayesian approach using a device suggested by Cavalli-Sforza and 
implemented by Reich and Patterson [50] to transform the 
microsatellite data into identity by state (IBS) Each allele is 
considered as a separate marker and reported as having one copy, 
two copies or no copies. The data thus comes to resemble bi-allelic 
data. The data was pruned by removing all markers with more 
than 10% missing data, then analyzed and plotted (Figure 4B) 
using PLINK program [51] and graph drawn with R version 
2.15.0. 

Phylogenetic trees and network 

Phylogenetic tree was constructed for the MT-G02 sequences 
using Neighbor-Joining (NJ) method. The NJ tree was calculated 
in MEGA5 [52]. Both Chimpanzee (GenBank accession 
NoD38113) and Neanderthal (EMBL accession number 
AM948965) sequences were used as out groups. The aligned 
and refined MT-C02 sequences were used to construct a Median 
Joining (MJ) Network using Network 4.6.1.1. (http://www.fluxus- 
engineering.com) as outlined by Bandelt et al. [53]. Dating was 
carried out using a coalescent based analysis and SkyLine Plot 
Bayesian inference, with options available in the programs 
Network [53] and BEAST [54]. The mutation rate was estimated 
at 0.5-1.0 nucleotide/site/ 100.000 years referring to published 
data on dates of divergence between humans and chimpanzee. 

Effective population size 

The effective population size (Me) was calculated using an MLE 
(maximum likelihood estimate) of theta value generated by 
LAMARC version 2.1.8 [55]. The method uses a coalescence 
approach to estimate population parameters by sampling random 
genealogies of sequences (or alleles) to calculate the parameters 
and Metropolis Monte Carlo sampling technique to concentrate 
the sampling in regions which contribute to the final result. The 
log value of the Me and the census size was used to plot the chart. 
Current and archaic Me values (Table 3) were also calculated from 
current and archaic theta values after generating growth rate and 
Bayesian estimation of current theta using LAMARC program. 
Archaic theta was calculated using the formula theta (t) = theta 
(now)EXP(-gtu). 

Bayesian Skyline Plot analysis 

In order to assess past changes in female effective population 
size (Me) further, we used the BEAST vl.8.0 [54] package to 
analyze timing and magnitude of past changes in population size. 
We analyzed the datasets under the HKY model putting the clock 
model as relaxed uncorrelated clock with the mean mutation rate 
of 9.2 xlO -7 substitutions per site per generation as used in 
LAMARC. We used a Bayesian Skyline coalescent tree prior with 
1 0 groups under a piecewise-constant model. The analysis was run 
for 10 million generations with parameters logged every 1000 
generations, and Tracer 1.6 (http://tree.bio.ed.ac.uk/software/ 
tracer/) was used to inspect chain convergence and conduct the 
skyline reconstruction (Figure 1). BSP analysis (supplementary 
Figures S5A-C) was also conducted for each individual populations 
for comparison purpose. 
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Supporting Information 

Figure SI Neighbor joining (NJ). NJ tree of the world 
populations based on MT-C02 sequences. The evolutionary 
relationship of 171 sequences and evolutionary history was 
inferred using the Neighbor-Joining method. The optimal tree 
with the sum of branch length = 0.20401570 is shown. The 
evolutionary distances were computed using the Maximum 
Composite Likelihood method and are in the units of the number 
of base substitutions per site. Codon positions included were 
1 st+2nd+3rd+Noncoding. All positions containing gaps and 
missing data were eliminated from the dataset. There were a 
total of 543 positions in the final dataset. Phylogenetic analyses 
were conducted in MEGA4. Red dots: east Africa, Blue: Africa, 
Green: Asia, Yellow: Australia, Pink: Europe and gray: America. 
(TIF) 

Figure S2 Multidimensional Scaling Plot (MDS). The 2nd 

and 3rd coordinates of an MDS plot of 848 nuclear microsatellite 
loci from 469 individuals of 24 world populations. MDS uses 
pairwise IBS data based on the 848 loci generated by PLINK 
software and plotted using R version 2.15.0. The figure, besides a 
separate clustering of east Africans, indicates the substantial 
contribution of Africans and east Africans to the founding of 
populations of Europe and Asia. 
(TIF) 

Figure S3 Multidimensional Scaling Plot (MDS). The 3rd 

and 4th coordinates of an MDS plot of 848 Microsatellite loci, 
across the human genome in 469 individuals from 24 populations 
from Africa, Asia and Europe. MDS uses pairwise IBS data based 
on the 848 loci generated by PLINK software and plotted using R 
version 2.15.0. The central position of east Africans and some 
other Africans emphasizes the founding role of east African gene 
pool and the disparate alignment on coordinates along which the 
world populations were founded including populations of Africa 
aligning along the 4th dimension. 
(TIF) 

Figure S4 Multidimensional Scaling Plot (MDS). First 
and second coordinates of an MDS plot based on MT-C02 data 
set constructed from pairwise differences FST generated by 
Arlequin v3.11. Population code as follows: Nara: Nar, Kunama 
(Kun), Hidarb (Hid), Afar (Afa), Saho (Sah), Bilen (Bil), Tigre 
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